Vulture: cloud-enabled scalable mining of microbial reads in public scRNA-seq data

Abstract The rapidly growing collection of public single-cell sequencing data has become a valuable resource for molecular, cellular, and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host–microbial studies from the public domain. In our benchmarking experiments, Vulture is 66% to 88% faster than local tools (PathogenTrack and Venus) and 41% faster than the state-of-the-art cloud-based tool Cumulus, while achieving comparable microbial read identification. In terms of the cost on cloud computing systems, Vulture also shows a cost reduction of 83% ($12 vs. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ${\$}$\end{document}70). We applied Vulture to 2 coronavirus disease 2019, 3 hepatocellular carcinoma (HCC), and 2 gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell type–specific enrichment of severe acute respiratory syndrome coronavirus 2, hepatitis B virus (HBV), and Helicobacter pylori–positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host–microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.


Bac kgr ound
Pathogenic diseases are considered a significant threat to global health, suc h as se v er e acute r espir atory syndr ome cor onavirus 2 (SARS-CoV-2) in coronavirus disease 2019 (COVID- 19), hepatitis B virus (HBV) and hepatitis C virus (HCV) in hepatocellular carcinoma (HCC) [ 1 ], and Helicobacter pylori in gastric cancer (GC) [ 2 ].Single-cell or single-nucleus RN A sequencing (sc/snRN A-seq) has reformed the investigation of complex diseases and contributed to discoveries of host-microbial interaction mechanisms [3][4][5][6][7].Due to the r a pidl y maturing scRNA-seq technologies, the exponentiall y gr o wing public scRN A-seq data resour ces have become a gold mine for conducting in silico investigations to w ar d hostmicr obial inter actions.
In the current practice of scRNA-seq data processing, a k e y concern is the selection of the r efer ence genomes when quantifying the reads.Most studies only align reads to the host genome or focus on limited microbial genomes [8][9][10][11][12].This practice systematically risks missing either the known or unknown host-microbial interactions in the datasets.It is ther efor e worthwhile to perform r eanal yses of existing public scRNA-seq data on the cloud to uncover the breadth of these interactions.According to the Human Cell Atlas [ 13 ] Data Portal, as of December 2022, there are an estimated 12.3 M cells from 2,400 specimens, totaling 38.1 TB in file size of published human cellular droplet-based scRNA-seq data.As more and more files become a vailable , cloud computing becomes incr easingl y enticing as the choice for performing largescale r eanal yses that can le v er a ge huge amounts of computa-tional resources without the need to purchase or maintain expensi ve hard ware and avoid the transfer of large amounts of data.
Se v er al tools have been developed for the identification of micr obial r eads in human scRNA-seq data on local mac hines.Vir al-Tr ac k [ 14 ] is an existing computational pipeline that detects viralhost interactions in droplet scRNA-seq data by scanning hostunma pped r eads for the pr esence of vir al RNA.Based on a similar schema, Zhang et al. and Lee et al. de v eloped P athogenTr ac k [ 15 ] and Venus [ 16 ], which have added capabilities.PathogenTrack can quantify bacteria in addition to viruses; the tool was benchmarked to be mostly correlated in microbial unique molecular identifiers (UMIs) called as and faster in runtime than Vir al-Tr ac k [ 15 ].Venus identifies viruses only but has another module to discov er vir al integr ation sites.Ho w e v er, due to the n umber of ste ps and certain tool choices in these pipelines, their scalability can still be impr ov ed.
These off-the-shelf microbial calling methods for scRNA-seq ar e de v eloped as command-line tools running on local computing en vironments , which ma y e v entuall y struggle with the scale of published data on the cloud.Only with cloud computing can we obtain the scRNA-seq big data as well as le v er a ge a huge amount of computational resources without the maintenance of expensiv e de vices.Pr e viousl y, we [ 17 ] de v eloped the cloud-based Falco fr ame work for scalable scRNA-seq analysis.On 2 public scRNAseq datasets, it was 2.6 to 145.4 times faster than running on the local computing en vironments .Li et al. [ 18 ] de v eloped a scalable scRNA-seq analysis framework based on the Terra platform called Cumulus afterw ar d.With Cum ulus, Delor ey et al. [ 19 ] performed COVID-19 scRNA-seq dataset analysis on 420 specimens from 11 organs in 2021.In 2022, Edgar et al. [ 20 ] de v eloped the Serr atus fr ame work.They r e vie wed 5.7 million transcriptome sequencing (RNA-seq) data for RNA-dependent RNA pol ymer ases and identified more than 105 novel RN A viruses.Ho w e v er, Falco onl y has supported the Smart-Seq protocol since 2017, which is insufficient no w ada ys .Also, neither the Cumulus nor the Serratus pipeline focus on the host-microbial scRNA-seq analysis.
To perform a large-scale meta-analysis of public scRNA-seq data, we de v eloped Vultur e, whic h to our knowledge is the first cloud-based scalable fr ame work for discov ering micr obial r eads in public scRNA-seq data.It can be executed either on the cloud container services in parallel or in a local environment.Our tool pr ovides an easil y modifiable, host-micr obial combined r efer ence that standardizes the gene transcript annotations of human and known human-host viruses and bacteria.Additional features of our Vulture are the support of multiple formats of raw sequencing file inputs and the quality control metrics of the identified intr acellular micr obial UMIs.We benchmarked the scalability and cost-effectiveness of our tool and show it outperforms existing solutions .With Vulture , we r eanal yzed cohorts of COVID-19, hepatocellular carcinoma, and gastric cancer with public raw sequencing data of droplet scRNA-seq and examined the host-microbial interactions of SARS-CoV-2, HBV, and H. pylori , r espectiv el y.Specificall y, we detected an upregulation of c hemokine r eceptor cr osstalk along with the coinfection of SARS-CoV-2 and human metapneumovirus (hMPV) from a COVID-19 bronchoalveolar lavage fluid (BALF) sample and potential HBV-induced copy number variations from an HCC sample .T he result shows the utility of viral calling to the full set of known host microbes.

Cloud infrastructure of Vulture on AWS batch infrastructure with Nextflow
The cloud fr ame work of Vultur e is described in Supplementary Fig. S1a .Vulture applications on the cloud are constructed in Docker containers.A container is a lightw eight softw are unit that pac ka ges all our pr ocedur es and dependencies for sequence alignment, quality control, and downstr eam anal ysis.Containerized Vulture applications are managed by the Amazon Web Services (AWS) Batch service.AWS Batch is a batch management ca pability to efficientl y run a huge amount of batch computing jobs on AWS.The Batch is a job scheduler composed of 4 elements, including Compute Environments, Job queues, Job definitions , and J obs .T he Compute En vironment specifies the computational resources required for a type of task.We applied the SPO T_C APACITY_OPTIMIZED allocation strategy in Batch to prioritize the use of spot instances in Compute En vironment.T he J ob Queue maps the Vulture pipeline task to 1 or more Compute Environments .T he J ob Definition is a template that assigns the Docker image to be emplo y ed in running a particular task along with its par ameters suc h as the number of CPUs, the amount of memory, and other configurations .T he J obs binds a Job Definition to a specific Job Queue and executes the task command in the Docker container.In the Vulture pipeline , J ob definitions and execution of Jobs ar e contr olled b y Nextflo w, a langua ge that str eamlines the deplo yment of w orkflo ws on the commer cial cloud and clusters.Nextflow creates the required Job Definitions and Jobs as needed.Each Job can use a different queue and Docker image .T he Vulture container is published in DockerHub and Elastic Container Reg-istry (ECR) that are accessible from the instances run by Batch.The Simple Stor a ge Service (S3) buc ket is wher e the input, output, and working directory of the Vulture pipeline are stored during execution.

Construction of host-microbe combined reference genome
The first step of Vulture is to construct r efer ence genomes and corresponding annotations for the host (human, in this study) and host-infection viruses and bacteria.We use 245 distinct human-host prokaryotes curated by the NCBI Genome [ 21 ] and 529 human-host viral species from viruSITE [ 22 ], which together with the human r efer ence genome hg38 form a combined reference set.The set is a collation of the r efer ence genome fasta sequences and exon/transcript/gene gtf annotations of all species used.Nonhost exons with a minimap2 ( RRID:SCR _ 018550 ) [ 23 ] alignment to the host genome were removed due to ambiguity.The combined host-microbe reference genome was indexed using the genomeGenerate module from the STARsolo ( RRID:SCR _ 021542 ) [ 24 ] tool.

Quantifying reads from scRNA-seq data to count matrices
Vulture supports a variety of alignment algorithms to quantify scRNA-seq sequences with the constructed combined r efer ence genome.Users can select STARsolo [ 24 ] (default), Cell Ranger [ 25 ], Kallisto | bustools [ 26 ], and Alevin [ 27 ].Sequence data can be quantified as a 2-dimensional UMI count matrix of cells × genes and the corresponding Binary Sequence Alignment Map (BAM).BAM files are only generated if STARsolo or Cell Ranger is selected.

Quality control of the mapped microbial reads and count matrix
After obtaining the results of the sequence alignment, we also perform additional quality control steps to increase the likelihood that the viral sequences found are intracellular and reliable.Vulture utilizes the EmptyDrops [ 28 ] algorithm to filter out the droplets with noncellular ambient RNA.We then optionally perform various quality analyses on the BAM files documenting the sequence alignments, including m ultima pping of the resulting sequences to host or nonhost genes and r eads dispersion, whic h is the extent of unique positions the reads are aligned on a given transcript.

Do wnstream anal ysis of scRN A-seq samples
The meta-analysis of the Vulture processed results is composed of se v er al pr ocedur es.We a pplied Seur at ( RRID:SCR _ 007322 ) [ 29 ] for the COVID-19, HCC, and GC samples listed in Table 1 to perform scRNA-seq processing and clustering, respectively.For the batc h effects r emov al acr oss differ ent cohorts, we a pplied Harmony [ 30 ] for the COVID-19, HCC, and GC samples.CellChat ( RRID: SCR _ 021946 ) [ 31 ] is applied to calculate the ligand-receptor interactions among the annotated cell types.Copy number variation (CNV) inference and clone identification for the HCC sample are analyzed by the inferCNV ( RRID:SCR _ 021140 ) [ 32 ] package.

Cell-type enrichment of microbial UMI
We follow the idea in [ 19 ] to calculate the cell type-specific enric hment scor e of intr acellular micr obes .T he reason is that the number of microbial UMIs is small, and their differences across cell types are difficult to observe .T he enrichment score for V is the number of micr obe-positiv e cells in cluster C, N V is the number of micr obe-positiv e cells in the whole cohort, P C is the proportion of the total number of cells in cluster C out of the total number of cells in the cohort, and ε is a small float to avoid zer o subtr actions.
The P value of cell type-specific enrichments of intracellular microbes was calculated by r andoml y perm uting the identical number of microbes' positive annotations to all cell types 10,000 times .T he empirical P value is the proportion of permutations that get an enrichment score not less than the actual score in the cohort out of 10,000 times.We also perform the false discovery rate (FDR) correction on the empirical P value.

Vulture: a cloud-based microbial calling fr ame w ork for public scRNA-seq data
The arc hitectur e of Vultur e is shown in Fig. 1 .Vultur e is composed of a bioinformatics analysis container, a cloud platform, and a w orkflo w management tool.The container defines 5 main processes for performing microbe calling for sc/snRNA-seq data: sequence data r etrie v al, human-micr obe combined r efer ence construction (optional), reads alignment, quality control, and downstr eam anal ysis (optional).Detailed implementations ar e listed in the Methods section and Supplementary Fig. S1a .Vulture receives 2 major inputs by default: (i) the sequencing files and (ii) microbe genome files .T he input sequencing files can be a set of run accession numbers (prefixed by SRR) from the Sequence Read Arc hiv e (SRA) to a set of Amazon S3 or HTTP downloaded URLs.Both fastq and bam files are supported.As for the input of combined r efer ence, we pr ovide a default host-micr obe r efer ence covering human and all human-host microbe genomes.Users can also build their custom combined genome by inputting a list of microbe genome accession numbers based on viruSITE or NCBI.
Vulture utilizes cloud computing to provide a fast, scalable, and cost-effectiv e vir al calling fr ame work without the need for har dw ar e maintenance.Vultur e is built on the AWS Batch service nativ el y, whic h efficientl y runs a huge amount of computing jobs while optimizing compute resources.At the same time, Vulture is implemented by a Docker container and can be easily run on local servers or other cloud platforms.We applied Nextflow ( RRID:SCR _ 024135 ), a w orkflo w mana gement langua ge to deploy complex parallel w orkflo ws of containers on clouds and clusters.The cloud arc hitectur e of Vultur e is described in the Method section and Supplementary Fig. S1b .Through an AWS Batch and Nextflow, users can run thousands of viral calling tasks for public scRNA-seq data in parallel with simple configuration inputs.

Runtime performance and cost-effecti v eness of Vulture on the cloud
To validate and benchmark the scalability of Vulture on the cloud, we tested it through the public COVID-19 scRNA-seq data by Bost et al. [ 14 ] consisting of up to 400 individual fastq files.Execution duration (pastel) and vCPU time (saturated) from retrieving files to bam analysis of running 25 to 200 parallel tasks are recorded in Fig. 2 A. Vultur e anal ysis on 200 fastq files was within 48 minutes, r eac hing a speed (compared to running 200 single tasks sequentially) of 155 ×, showing that it is highly scalable.We also compar ed Vultur e to another cloud-based tool Cumulus [ 18 ] in read mapping on an identical prebuilt host-microbe genome because Cumulus did not natively support viral calling tasks.Figure 2 A indicates that Vulture outperformed Cumulus nearly 2-fold in total duration (saturated).Costing $12, Vulture runs 200 alignment tasks in 20 minutes, while Cumulus needs $69 to run 200 samples in 32 minutes.Vulture takes adv anta ge of the AWS Batch spot capacity optimization technique.Its utilization of spot instances reduced the cost of running viral calling pipelines.Also, it ensured the availability of computational resource allocation, maximizing the number of concurrent tasks and minimizing the response time of pending tasks.

Performance of Vulture on the local environment against off-the-shelf tools
We also tested Vulture local command-line tools against Venus and P athogenTr ac k [ 15 , 16 ] using 3 COVID-19 scRNA-seq samples with different sizes: a 1-GB small sample (SRR12570205) from Bost et al. [ 14 ], a 67-GB medium sample (SRR11537951), and a 141-GB large sample (SRR11181956) from Liao et al. [ 33 ].Fig. 2 B indicated that Vulture is the most computationally effective method among the three.On medium and large samples, Vulture took 83 and 187 minutes to finish the analysis.It was 2-to 3-fold faster than Venus (463 and 538 min utes, respecti vely) and 9-to 10-fold faster than P athogenTr ac k (832 and 2,214 min utes, respecti vely).Besides, we tested the consistency among methods on 3 datasets by measuring the intersection of SARS-CoV-2-positive cells in Fig. 2 B. The r esult of Vultur e befor e filtering the empty droplets [ 28 ] named "Vultur e (unfilter ed)" is also added to the comparison.Since Vultur e (unfilter ed) is a superset of the Vulture result, the intersection between the two is the Vulture set in the figure.Vulture is a containerized computational framework composed of 5 pr ocedur es .T he 5 steps include (i) multiformat scRNA-seq data and microbial genome retrieval, (ii) custom combined reference construction, (iii) reads alignment, (iv) quality control, and (v) downstream analysis.All procedures in the Vulture architecture run as containerized applications on the Amazon Batch service.
SARS-CoV-2-positive cells (Fig. 2 D) that cover most of the cells identified by others ( Supplementary Fig. S2a ).It is the quality control step that filtered out many empty droplets.On 135 cells, the intersection of the three, we calculated the mean absolute error (MAE) and Pearson correlation of SARS-CoV-2 viral UMI counts across methods in Supplementary Fig. S2c and d .The MAEs between Vulture , Venus , and P athogenTr ac k ar e smaller than 1, showing that Vulture consistently generated microbial calling results compared to state-of-the-art methods in a faster manner.

Vulture enables cloud-based discovery of metapneumovirus reads in COVID-19 BALF samples
SARS-CoV-2 infection has been identified to be the source of the worldwide COVID-19 pandemic since 2019.Many aspects of how the viral-host interaction have remained unrevealed.There is a major interest in identifying coinfection of other pathogens in patients with COVID-19.Ther efor e, we a pplied Vultur e on the cloud on BALF samples from the SRA to call viruses.We performed a meta-analysis on the Liao et al. [ 33 ] cohort from China (SRP250732) and the Bost et al. [ 14 ] cohort (SRP279746) from Isr ael.We r an a downstr eam anal ysis on BALF samples for 2 cohorts, totaling 51,338 and 991,722 cells after all QC filtering, respectiv el y.After pr epr ocessing and clustering, the cell types were defined based on marker genes from Bost et al. and Liao et al.
( Supplementary Fig. S3b and c ).Given the fact that SARS-CoV-2 UMIs in scRNA-seq data are relatively low and imbalanced, a statistical test (see Methods) [ 19 ] is performed to estimate cell typespecific enrichment of SAR-CoV-2 infection.The combined microbe-host genome in Vulture includes a compr ehensiv e set of human-host microbes to identify coinfections or unaware microbes.Vulture revealed human metapneumovirus (hMPV) coinfection with SARS-CoV-2 in the Liao et al. [ 33 ] cohort (SRP250732) and unexpected herpes simplex viruses (HSV) in the Bost et al. [ 14 ] cohort (SRP279746), consistent with pr e vi-ous findings [ 14 ].UMIs for different viral transcripts are in Fig. 3 A and B. Cell-type visualization of BALF cells in the Liao cohort is in Fig. 3 C, with SARS-CoV-2 and hMPV presence in Fig. 3 D and E. UMAP plots showing the distribution of and SARS-CoV-2 and HSV for the Bost cohort are presented in Supplementary Fig. S3a .Statistical tests found SARS-CoV-2 enriched ( P < 0.05) in epithelial cells , neutrophils , and plasma B cells (Fig. 3 D and Supplementary Table S1 ), as well as hMPV enriched in CD8 + T cells, natural killer cells, macr opha ges, and monocytes (Fig. 3 E and Supplementary Table S3 ).Fig. 3 E shows a separate monocyte subtype with hMPV infections .T her efor e, we compar e the differ ential expr essed genes between the hMPV-enric hed monocytes/macr opha ges to the hMPV-negativ e monocytes/macr opha ges.Differ entiall y expr essed genes for the vir all y infected subtype are shown in Supplementary Table S7 .S100A8/S100A9 were upregulated in hMPV-enriched macr opha ges/monoc ytes and inv olv ed in neutr ophil-r elated inflammation [ 34 ].FCN1 upregulated, encoding a complement cascade member [ 3 ].IDO1 upregulated, and murine coronavirus infection activate dAhR independently, affecting cytokines [ 35 ]. g:Profiler [ 36 ] identified functional enrichment of the top 100 upregulated genes in hMPV-enriched subtypes.Fig. 3 F results indicate subtypes' patterns drive innate immune, cytokine responses.Interferon-gamma (IFN-γ ) response activates IFN response in alveolar macrophages, recruits monocyte-derived alveolar macr opha ges, and forms an inflammatory signaling circuit [ 34 ].

Cloud-based meta-analysis reveals an HBV-associated CNV signature in HCC
Another adv anta ge of having a cloud-based fr ame work is that it facilitates the integration of multiple datasets that are already in the same repository on the cloud.We performed a meta-analysis of the 3 public HCC cohorts with droplet scRNAseq sequencing data, SRP278381, SRP136347, and SRP318499.We r an Vultur e on the HCC samples of 24 patients in these cohorts, totaling 421,780 cells following all QC filtering.After clustering and integration, the cell types were defined based on marker genes from Sharma et al. [ 40 ] (SRP278381) ( Supplementary Fig. S5c ).Microbial enrichment detection ( Supplementary Table S4 ) on each of the cohort's indicated hepatocytes is the only cell type with HBV enrichment ( Supplementary Fig. S5b ).
To further delineate the HBV enrichment within hepatocytes, we r ecluster ed and r eintegr ated the hepatocytes of 11 of 24 patient samples that contained any HBV expression (Fig. 4 A) and found that the onl y HBV-enric hed subclusters wer e 0 and 3 ( Supplementary Table S5 ), with both subclusters enriched in 2 of 3 cohorts (Fig. 4 B).We analyzed the CNV of each patient using in-ferCNV with the macr opha ges as r efer ence cells and hepatocytes as observation cells ( Supplementary Figs.S6 -S8 ); we noted generally the CNV clones with more HBV expression indeed mostly consisted of subclusters 0 and 3, while the clones with less HBV expression consisted of mainly subcluster 2. We picked 3 representativ e patients (fr om 2 cohorts) with well-defined CNV clones and a sufficient ( > 50) number of HBV-positive cells, P114_SRP318499, P725_SRP318499, and P7_SRP278381, and generated an ov er all CNV for this set (Fig. 4 C).The result shows that for patients in different cohorts, the clones with discernable, less ambiguous CNV patterns have a clear majority of cells with HBV expression compared to the other clones (Fig. 4 C, green boxes).
We studied the CCI pattern of HBV-hepatocyte interactions and further grouped hepatocytes into HBV-positive ( + ) and normal (not HBV-positi ve) he patocytes.CCIs in HBV-positi ve subclusters  0 and 3 had higher r elativ e str ength than normal ones, shown in Fig. 4 D. Pr oteinase-activ ated r eceptor (PAR) signaling pathway is tested to be the most significant signaling pathway (Fig. 4 D and E) across the HBV-enriched hepatocytes.PARs as the thrombin receptors ar e involv ed in thr ombin-induced cell migr ation acr oss a colla gen tr ansmembr ane barrier [ 41 ].Midkine (MK) is a growth factor that is tested to be a crucial role in HCC.It is involved in inflammatory responses, acts as an antiapoptotic factor, and blocks anoikic to promote metastasis [ 42 ].

Identification of H. pylori reads in gastric cancer
We also conducted a meta-analysis of the 2 gastric cancer (GC) cohorts that have publicly available droplet scRNA-seq sequencing data, SRP215370 (Zhang et al. [ 11 ]) and SRP261119 (Kim et al. [ 12 ]).The former includes early GC samples (in which we only used the 2 confirmed H. pylori -positive patients), and the latter contains GC patient samples, the majority of which w ere kno wn to be H. p ylori positive .In total, we a pplied Vultur e on the earl y GC or GC samples of 15 patients in the 2 cohorts, amounting to 125,845 after all QC filtering.We used the same a ppr oac h as the above HCC case study for clustering and integration and labeled the cell types in line with Kim et al. [ 12 ] (SRP261119) (Fig. 5 A).Microbial enrichment detection by cohort showed H. pylori enrichment in endothelial cells , fibroblasts , and macr opha ges in SRP215370 and enric hment onl y in pit mucous cells in SRP261119 (Fig. 5 C and Supplementary Table S6 ).The difference is also present in the H. pylori virulence of 2 cohorts shown in Fig. 3 B. CagA virulence gene is only detected in the Zhang et al. [ 12 ] cohort where cagA-positive str ains ar e the str ongest risk factor of gastric cancer [ 43 ].

Discussion
In recent years, the rapidly growing single-cell study sources have become a gold mine for the r einv estigation of host-microbial interactions.But neither traditional practices on scRNA-seq studies nor the bioinformatics methods de v eloped to detect micr obes ar e ca pable of scalable meta-anal ysis for the public open data on the cloud.Alternativ el y, cloud computing has become an essential piece of equipment.Here, we present Vulture, a cloud-based scalable fr ame work for calling micr obial RN A on public scRN A-seq resour ces.Vulture w as benchmarked on data originating from various tissues, generated with different scRNA-seq platforms, and deposited in different formats.It is tested to be highly scalable and cost-effective because it runs 200 analyses within a similar duration and is low cost compared to a single task.Mor eov er, running a single Vulture analysis on the local environment is substantially faster than pr e vious methods .T he reason is that Vultur e pr ovides an easily customizable combined r efer ence.It performs r ead mapping once, while others map read to the host and then align unma pped r eads to the micr obe r efer ences, among other intermediate steps.Vulture runs faster by getting rid of complex prepr ocessing on unma pped r eads .T he combined r efer ence is 30% larger than the host genome, but indexing from alignment tools can compensate for the incr eased r efer ence size.Also, Vultur e is user-friendly because it supports multiple platforms and multiple input formats.We demonstrated that Vulture can readily provide an effective solution for viral calling meta-analysis on large-scale public data.We applied Vulture and scRNA-seq analysis to public COVID-19 BALF cells , HCC samples , and GC samples .T he COVID-19 analysis r e v ealed that Vultur e can identify coinfections of unexpected pathogens .T he HCC samples discov er ed a potential crucial relationship between CNV and intracellular HBV.All those cases indicate that Vulture is highly valuable to study unknown mechanisms and treatments by mining large-scale single-cell data.
Ho w e v er, ther e ar e se v er al limitations of Vultur e.A k e y hinder to Vulture meta-analysis is the permission of data.Serval atlasle v el databases have strict access permission, which makes it hard to run cloud-based analysis on a large scale.Also, large-scale meta-data cleansing for raw sequencing files, which is essential to run parallel meta-analysis appropriately, is difficult because sequencing files ar e gener ated by different protocols.A prospective solution is to incor por ate biomedical natur al langua ge pr ocessing (bioNLP) models and search eng ine technolog ies to subtract metadata for Vultur e. Ultimatel y, Vultur e automaticall y digs the gold mine of host-microbial interactions in big data.
In summary, Vulture is a cloud-based scalable framework for calling microbial RNA on public scRNA-seq resources.It is highly scalable, is cost-effective on the cloud, and substantially outperforms pr e vious methods in local en vironments .We anticipate that Vulture will play a crucial role in the attempt to understand the unr e v ealed genetics of pathogenic diseases as the community gr aduall y contributes to the increasing scale of single-cell data for host-microbial interactions.

Figure 1 :
Figure 1: Schematic diagram of Vulture for scRNA-seq microbial calling on the cloud.Vulture is a containerized computational framework composed of 5 pr ocedur es .T he 5 steps include (i) multiformat scRNA-seq data and microbial genome retrieval, (ii) custom combined reference construction, (iii) reads alignment, (iv) quality control, and (v) downstream analysis.All procedures in the Vulture architecture run as containerized applications on the Amazon Batch service.

5 TotalF igure 2 :
Cloud mining of microbial reads in scRNA-seq data | P erformance benc hmark of the Vultur e. (A) Performances of the Vultur e pipeline to run 25 to 200 par allel anal yses .T he performance is measured by execution duration (pastel) and vCPU time (saturated) with the respecti ve n umber of parallel tasks .T he time of each of the 4 steps in the Vulture pipeline is displayed separately.The line plot is the time for Cumulus to run 25 to 200 parallel read alignment tasks in comparison to the mapping step of Vulture.Total time (star-marked) and individual task durations (circle-marked) of different numbers parallel run processed by the 2 pipelines.(B) Task duration comparison among viral calling methods to run a single analysis with different input file sizes .T he local version of Vulture is composed of 3 steps (Ma p, Anal ysis, and Filter), P athogentr ac k is composed of 2 steps (Map and Count), and Venues is end-to-end.The total duration of tasks is presented in stacked bar plots.(C) Cost (in US dollars) of the computation resource needed to run 200 analyses on cloud platforms of Vulture and Cumulus.(D) Consistency among viral calling methods .T he consistency is estimated by measuring the intersections of virus-positive cells annotated by different tools.Vulture results and results before the filtering step are discussed separately.

Figure 3 :Figure 4 :
Figure 3: Viral calling meta-analysis results on the COVID-19 BALF samples.(A, B) Transcript UMIs of 3 major detected viruses (SARS-CoV-2, hMPV, and HSV) from SRP250732 and SRP 279746, respectively.(C) UMAP plot of the COVID-19 BALF data; cells are colored by cell-type annotations.(D, E) UMAP plots of the SARS-CoV-2 and hMPV infection, r espectiv el y.Infected cells are colored orange while other cells are gray.(F) The top 15 enriched Gene Ontology (GO) terms identified by functional enrichment analysis.(G) Cell-cell interaction (CCI) strengths across different cells grouped by cell types along with viral infections.(H) Enriched signaling pathways identified by CellChat.(I) The CCI among cell types through the CCL signaling pathway network.

Figure 5 :
Figure 5: Viral calling meta-analysis results on the GC sample.(A) UMAP plot of the GC cells colored by cell-type annotations.(B) The transcript UMIs of H. pylori identified in SRP215370 and SRP2161119, separ atel y. (C) UMAP plots of the intracellular H. p ylori .H. p ylori-positive cells are colored purple while other cells are gray.

Table 1 :
Ov ervie w of the datasets pr ocessed in this study.